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1. Introduction 

To remove inconsistencies, reduce overhead and enable important functionalities, a revision of the syntax and 
semantics of the current MPEG-4 Video Verification Mode] (VM) is proposed. The revised video syntax included 
in this document supports all of the current features in VM2.2, and in addition, also enables functionalities such 
as scalability and provides flexibilities that may be useful for error resilience and multi- viewpoint coding. The 
proposed syntax consists of the following class hierarchy: 

. • VideoSession (VS) 

• VideoObject (VO) 

• VideoObjectLayer(VOL) 

• VideoObjectPlane (VOP) 

Within the context of video experiments, it can be said that a VS is a collection of one or more VO's, a VO can 
consist of one (nonscalable) or more layers (scalability) and that each layer consists of an ordered sequence of 
snapshots in time called VOPs. Thus there can be several VO's (VOO, VOl,..) in a VS and for each VO, mere can 
be several scalability layers (VOLO, VOL1,..) and each scalability layer consists of time sequence of VOPs (VOPO, 
VOP1 ,..), which are basically snapshots in time. A VO can be of arbitrary shape (rectangular is a special case). For 
nonscalable coding only one VOL (VOLO) exists per VO. In scalable coding VOLO would be the base layer and 
VOL1 the first enhancement layer and so forth. Figure 1 shows the hierarchical structure of the proposed syntax. 
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Figure 1 : Hierarchy in the proposed video syntax 

In Section 2 we provide the complete syntax and semantics including that for generalized scalability which is 
based on MPEG-2 Temporal Scalability syntax and is extended to provide Object-based Temporal Scalability. In 
Section 3 we provide a description of the generalized scalability; Section 4 provides a summary of this document. 

2. Syntax and Semantics 



2.1 VideoSession 



Syntax 


No. of bits 


Mnemonic 


VideoSessionO { 
video_sesslo n_start_code 

do { 
do{ 

VideoObjectO 
} while (nextbitsO — video_object_start_code) 
if(nextbitsO !- session_end_code) 
video_session_start_code 
. } while (nextbitsO ! = video_session_ehd_code) 
video_sess!on_end_code 

> 


32 

32 
32 




22 VideoObject 


Syntax 


No. of bits 


Mnemonic 


VideoObjectO { 
video_object_start_code 
objectjd 
do{ 

VideoObjectLayerO 
} while (nextbitsO ~ vi deo_obj ect_Iayer_start_code) 
next start codeO 


24+3 
5 




obJecLid 

It uniquely identifies a layer. It is a 5-bit quantity with values from 0 to 3 1 . 






13 VideoObjectLayer 






\ Syntax 


No. of bits 


Mnemonic | 





VideoObjectLayerO { 

yidco_object_layer_start_code . 28 
layer_ld 4 
layer_width 10 
layer_height .10 
quant_type_sel 1 
if(quant_type_sel) { 

Ioad_lntra_quant_mat 1 
if (load_intra_quant_mat) 

lntra_quant_matl64] 8*64 
load_nonintra_quant_mat 1 

if Ooad^nonintra__quant_mat) 
nonintra_quant mat [64] 8*64 

> ; .. ... ' 

intra_dcpred_disable 1 
scalability 1 

. if (scalability) { . 

ref_laycr_id 4 
ref_Iaycr_sampUrig_dfrec .1 
hor_sarapIing_factor_n 5 
hor_sampling_factor_m "5 
vert_sampling_factor_n 5 


• 


1 


enhancement_type 1 Shf 






/ ■ 
do{ . 

VideoObjectPlaneO 
} while (nextbitsO — video_object_plane_start_code} 
next start codeO 

> 





layerjd 

It uniquely identifies a layer. It is a 4-bit quantity with values from 0 to 15. A value of 0 identifies the first 
independently coded layer. 

Iayer_wldth, la yer_h eight 

These values define the spatial resolution of a layer in pixels units, 
scalability 

This is a 1 -bit flag which indicates if scalability is used for coding of the current layer. 
re(Jayer_id 

It uniquely identifies a decoded layer to be used as a reference for predictions in the case of scalability. It is a 4- 
bit quantity with values from 0 to 1 5. 

re(_layer_sampllng_direc 

This is a 1-bit flag whose value when "0" indicates that the reference layer specified by ref_layer_id has the same 
or lower resolution as the layer being coded. Alternatively, a value of "1" indicates that the resolution of 
reference layer is higher than the resolution of layer being coded resolution. 

hor_sampllngJactor_n, hor_sampUn£_factor^m 

These are 5-bit quantities in range 1 to 31 whose ratio hor_samplingifactor_n/hor_sampling_factor_m indicates 
the resampling needed in horizontal direction; the direction of sampling is indicated by ref_layer_sampling_direc. 

vert_sampIlng_factor_n t vert_sampllng_factor_m 

These are 5-bit quantities in range of 1 to 31 whose ratio. vert_sampling_factor_n/vert_sampling„factor_m 
indicates the resampling needed in vertical direction; the direction of sampling is indicated by 
ref_layer_sampling_direc. 



^ enhancementjype 

^ This is a 1 -bit flag which indicates the type of an enhancement structure in a scalability. It has a value of "1" 

^ when an enhancement layer enhances a partial region of the base layer. It has a value , of "0". when an 

(j) enhancement layer enhances entire region of the base layer. The default value of this flag is "0". 



Other syntax elements such as quant_type_sel and intro_dcpred_disable in the Video Object Layer have the 
same meaning described in VM. 



2.4 Video Object Plane 





Syntax No. of bits Mnemonic 




VideoObjectPlaneO { 

video_obJect_pIane_start_code 32 
plane temp_ref 16 
plane.vislbillty I 
plane_of_arbltrary_shape 1 
if (plane_of_arbitrary_shape) { 
plane_wldth 10 
plane_helght .10 
if(plane_visibility) { 

. plane_composltion_order 5 - . 
plane„hor_spatiaLref 10 . 
marker_bit . 1 
plane_vert_spatial_ref 10 

- piano cf*allng 1 ' 




if (scalability && enhancement_type) : SHARP 
background.composition 1 


- ' 




i 

shapeO 

} 

plane_coding_type . 2 
if(plane_coding_type= 1 ||plane_coding_rype==2) { 
plane_fcode_for 2 
if (plane_codihg_type = 2) { 
plane_fcode_back 2 
planejdbquant 2 

} • 
else{ 

plane_quant 5 

} 

* 

if (Iscalability) { . 

separate_motion_texture 1 

if (!separate_motion_texture) 

. combined_motion_texture__codingO 

else { . 

motion_codingO 

texture codingO 

} 

} 




I SHARP 


if(background_composition) { 
load_backward _shape 1 
if(load_backward_shape) { 
backward_shapeO 

load_forward_?hape 1 

if (load_forward_shape) 





forwafd,shapeQ 



SHARP 



} 



T 

} 

rcOelecCcode 

if(plane_coding_type = 1 1| plane_coding_type = 2) { 
forward_temporaI_ref 
if (plane_coding_type = 2) { 
marker_bit 

backward_temporaI_ref 

} 

} 

combined_motion_texture_codingO 



2 

10 

1 
10 



1 



Q. 

< 
x 



background_composition 

This flag only occurs when scalability flag has a value of "1". The default value of this flag is "0" This flag is 
used in conjunction with enhancement_type flag. If enhancement.rype is "P and this flag is "1", background 
composition is performed. If enhancement type is "1" and this flag is "(T, background is repeated from the nearest 
frame in base layer. Further, if enhancement type is "0" no action needs to be taken as a consequence of any 
value of this flag, 

shapeQ 



The shapeQ function generates the format of the coded data of a current shape (alpha plane). 


Syntax 


No. of bits Mnemonic 


shapeO { 




binary_shape 


1 


if(binary_shape) { 




do{ 




flrst_QT_code 


1-2 


if(first_QT_code="00") 




subsequent_QT_codes 




} while (count of macroblock != total number of macroblocks) 




}else{ 




do { 




flrst_QT_code 




if (first_QT_code=="00") { 




subsequent_QT_codes 




VQ_codes 

.. } 

} while (count of macroblock != total number of macroblocks) 

> 

) 


0-128 





load_backward_shape 

If this flag is "1", backward_shape of the previous VOP is copied to forward_shape for the current VOP and 
backward.shape for the current VOP is decoded from the bitstream. If not, forward_shape for the previous VOP is 
copied to forward.shape for the current VOP and backward_shape for the previous VOP is copied to 
backward_shape for the current VOP. 

backward jshapeQ 

It specifies the format of coded data for backward.shape and is identical to that of shape(). A boundary 
rectangle of backward _shapeQ is same as the entire image. 



loatLforward _shape 

This flag is "1" if forward.shape will be decoded from a bitstream. 




a. . • .'• ', 

^ forward _phape() 

-£ It specifies the format of coded data for forward_shape and is identical to that of shapeQ. A boundary rectangle 
(/) offorwardjskapeO is same as the entire image. ." 

ref_select_code . 

This is a 2-bit code which indicates prediction reference choices for P- and B-VOPs in the enhancement layer with 
respect to decoded reference layer identified by ref_layer_id. 

. forwanLtemporal_ref 

An unsigned integer value which indicates temporal reference of the decoded reference layer VOP to be used for - " 
forward prediction (Table 1 and Table 2) 

backwanLtemporai_ref 

Ah unsigned integer value which indicates temporal reference of the decoded reference layer VOP to be used for ; 
backward prediction (Table 2). 

3. Generalized Scalability 

Generalized scalability involves more than one layer in VideoObjectLayer. Considering the case of two layers, a 
lower layer and an enhancement layer, the spatial resolution of each layer may be either the same or different; 
when the layers have different spatial resolution, (up or down) sampling of lower layer with respect to the 
enhancement layer becomes necessary for generating predictions. If the lower layer and the enhancement layer 
are temporally offset, irrespective of the spatial resolutions,, motion compensated prediction may be used 
between layers. When the layers are coincident in time but at different resolution, motion compensation may be 
switched off to reduce overhead. 

The reference VOPs for prediction are selected by reference_select_code as described in Tables I and 2. In 
coding P-VOPs belonging to an enhancement layer, the forward reference can be one of the following three: the 
most . recent decoded VOP of enhancement layer, the most recent VOP of the lower layer in display order, or the 
next VOP of the lower layer in display order. 

In B-VOPs, the forward reference can be one of the two: the most recent decoded enhancement VOP or the most 
recent lower layer VOP in display order. The backward reference can be one of the three: the temporally 
coincident VOP in the lower layer, the most recent lower layer VOP in display order, or the next lower layer VOP in 
display order. 

Table 1 : Prediction reference choices for P-VOPs in the object-based temporal scalability. 



ref_select_code 


forward prediction reference 


00 


Most recent decoded enhancement VOP 
belonging to the 1 same layer. 


01 


Most recent VOP in display order belonging 
to the reference layer. 


10 


Next VOP in display order belonging to the 
reference layer. 


11 


Temporally coincident VOP in the reference 
layer (no motion vectors) 



Table 2 : Prediction reference choices for B-VOPs fn the case of scalability. 



ref_select_code 


1 forward temporal reference 


backward temporal reference 


00 


Most recent decoded enhancement VOP 
of the same layer 


Temporally coincident VOP in the 
reference layer (no motion vectors) 


01 


Most recent decoded enhancement VOP 
of the same layer. 


Most recent VOP in display order 
belonging to the reference layer. 




10 


Most recent decoded enhancement VOP 
of the same layer. 


Next VOP in display order belonging to 
the reference layer. 


11 


Most recent VOP in display order 
belonging to the reference layer. 


. Next VOP in display order belonging to 
the reference layer. 



The enhancement layer can contain I, P or B-VOPs, but the B-VOPs in the enhancement layer behave more like P- 
VOPs at least in the sense that a decoded B-VOP can be used to predict the following P or B-VOPs. 

When the most recent VOP in the lower layer is used as reference, this includes the VOP that is temporally 
coincident with the VOP in the enhancement layer. However, this necessitates use of lower layer for motion 
compensation which requires motion vectors. 

If the coincident VOP in the lower layer is used explicitly as reference, no motion vectors are sent and this mode 
can be used to provide spatial scalability. Spatial scalability in MPEG-2 uses spatio-temporal prediction, which is 
accomplished here by using the prediction modes available for B-VOPs. 

Since the VOPs can have a rectangular shape (picture) or an irregular shape, both the traditional as well as object 
based temporal and spatial scalabilities become possible. 

We explain next the meaning of enhancement_type flag in more detail. As an example, Figure 2 shows an entire 
image containing several types of regions for example a road, a car, and mountains. Both the base layer with 
enhancement_type being "0" and the base layer with enhancement_type being "1" are coded with lower picture 
quality which means that either the frame rate is lower or the spatial resolution is lower. At the enhancement layer 
of the scalability, enhancement.type flag distinguishes the following two cases. 

• When this flag is "1", the enhancement layer increases the picture quality of a partial region of the base layer. 
For example, in Figure 2, VO0 is an entire frame and VOl is the car in the frame. The temporal resolution or the 
spatial resolution of the car is enhanced. 

• When this flag is M 0", the enhancement layer increases the picture quality of the entire region of the base 
layer. For example, in Figure 2, if VO0 represents an entire frame, VOl is also the entire frame. Then the 
temporal or spatial resolution of entire frame is enhanced. If VO0 represents the car, VOl is also the car which 
is enhanced in terms of temporal or spatial resolution. 

4. Summary 

A new syntax and clear semantics for are proposed. The syntax class hierarchy consists of the following: 

• VideoSession (VS) 

• VideoObject(VO) 

• VideoObjectLayer(VOL) 

• VideoObjectPlane (VOP) 

This syntax not only supports all features of the current VM but also important functionalities such as object 
based scalability. For nonscalable coding, the overhead is reduced by moving the parameters that do not change 
from a VOP to the level of VOL which occurs less frequently. It introduces scalability in a structured manner. 
Since the proposed scalability syntax is based on the simplification of MPEG-2 scalability syntax with minimal 
extensions necessary to enable object scalability it is efficient. In addition to scalability the flexibilities offered by 
the syntax are expected to be useful for error resilience and multi-viewpoint functionalities. 

In addition, issues in generalized scalability including how predictions are formed are explained in detail. 
Traditional spatial and temporal scalabilities suitable for the lower bitrates MPEG-4 is addressing are derived as a 
subset of the generalized scalability syntax. Scalability on arbitrary shaped objects as well as rectangular (picture) 
objects is also supported by the generalized scalability. 
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: region to be enhanced by an enhancement layer 
Figure 2 : Example of a region to be enhanced. 



